Length-Incremental Phrase Training for SMT

نویسندگان

Joern Wuebker

Hermann Ney

چکیده

We present an iterative technique to generate phrase tables for SMT, which is based on force-aligning the training data with a modified translation decoder. Different from previous work, we completely avoid the use of a word alignment or phrase extraction heuristics, moving towards a more principled phrase generation and probability estimation. During training, we allow the decoder to generate new phrases on-the-fly and increment the maximum phrase length in each iteration. Experiments are carried out on the IWSLT 2011 Arabic-English task, where we are able to reach moderate improvements on a state-of-the-art baseline with our training method. The resulting phrase table shows only a small overlap with the heuristically extracted one, which demonstrates the restrictiveness of limiting phrase selection by a word alignment or heuristics. By interpolating the heuristic and the trained phrase table, we can improve over the baseline by 0.5% BLEU and 0.5% TER.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Incrementally Updating the SMT Reordering Model

This work is concerned with incrementally training statistical machine translation (SMT) models when new data becomes available. That, in contrast to re-training new models based on the entire accumulated data. Incremental training provides a way to perform faster, more frequent model updates, enabling keeping the SMT system up-to-date with the most recent data. Specifically, we address increme...

متن کامل

Lexical Syntax for Statistical Machine Translation

Statistical Machine Translation (SMT) is by far the most dominant paradigm of Machine Translation. This can be justified by many reasons, such as accuracy, scalability, computational efficiency and fast adaptation to new languages and domains. However, current approaches of Phrase-based SMT lacks the capabilities of producing more grammatical translations and handling long-range reordering whil...

متن کامل

Decoder-based Discriminative Training of Phrase Segmentation for Statistical Machine Translation

In this paper, we propose a new method of training phrase segmentation model for phrasebased statistical machine translation(SMT). We define a good segmentation as the segmentation producing a good translation. According to this definition, we propose a method that can discriminate between a good segmentation and a bad segmentation based on the translation quality. The proposed approach constru...

متن کامل

Incremental Re-training for Post-editing SMT

A method is presented for incremental retraining of an SMT system, in which a local phrase table is created and incrementally updated as a file is translated and post-edited. It is shown that translation data from within the same file has higher value than other domain-specific data. In two technical domains, within-file data increases BLEU score by several full points. Furthermore, a strong re...

متن کامل

Dynamically Integrating Cross-Domain Translation Memory into Phrase-Based Machine Translation during Decoding

Our previous work focuses on combining translation memory (TM) and statistical machine translation (SMT) when the TM database and the SMT training set are the same. However, the TM database will deviate from the SMT training set in the real task when time goes by. In this work, we concentrate on the task when the TM database and the SMT training set are different and even from different domains...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

Length-Incremental Phrase Training for SMT

نویسندگان

چکیده

منابع مشابه

Incrementally Updating the SMT Reordering Model

Lexical Syntax for Statistical Machine Translation

Decoder-based Discriminative Training of Phrase Segmentation for Statistical Machine Translation

Incremental Re-training for Post-editing SMT

Dynamically Integrating Cross-Domain Translation Memory into Phrase-Based Machine Translation during Decoding

عنوان ژورنال:

اشتراک گذاری